Data Dredging

Jon Wayland

9/5/2023

Note: I originally wrote this on Quora in 2019.

Imagine a marketing agency who, while trying to prove their business is worth it, asks an analyst (let’s call her Sasha) to “prove” their return on investment.

The ask goes something like this:

I’d like to be able to say that there is a significant correlation between our involvement index and our client’s return on their investment so that we can attribute their success to us and not their other marketing affiliates.

Assume involvement index is standardized across all marketing vendors, and that this broad yet specific request will be proven by simply determining the correlation between the index and ROI.

Sasha begins by examining the relationship across all clients:

The data doesn’t suggest there is likely to be a relationship, let alone a positive one.

In an effort to please her leadership, Sasha decides to start making some assumptions:

removes clients with less than one year of business
removes clients with more than 3.5 years of business
removes clients with market-share in Washington D.C.
removes clients in the automotive industry
removes clients using the marketing vendor with the largest market share
removes clients founded in 1979

After applying these assumptions, Sasha revisits the relationship between ROI and Involvement Index:

This looks better. Much better. Sasha reports back to leadership and shows them the good news.

The agency decides to give her a promotion, and proceeds to advertise their wonderful impact they have on their clients in order to win new business.

The effort to please her leadership is where Sasha committed the unethical act of data dredging.

Rather than proving or disproving a predefined hypothesis, Sasha first decided what her outcome was going to be and tampered with the data until it supported it.

She made assumptions that had no statistical relevance in her analysis so that she could find the subset of clients who tell the positive story.